Search CORE

405 research outputs found

Towards an Information Retrieval Theory of Everything

Author: Hiemstra Djoerd
Publication venue: NVTI
Publication date: 01/01/2009
Field of study

I present three well-known probabilistic models of information retrieval in tutorial style: The binary independence probabilistic model, the language modeling approach, and Google's page rank. Although all three models are based on probability theory, they are very different in nature. Each model seems well-suited for solving certain information retrieval problems, but not so useful for solving others. So, essentially each model solves part of a bigger puzzle, and a unified view on these models might be a first step towards an Information Retrieval Theory of Everything

CiteSeerX

Radboud Repository

University of Twente Research Information

Probability smoothing

Author: Hiemstra Djoerd
Publication venue: Springer
Publication date: 01/01/2009
Field of study

Contains fulltext : 227633.pdf (publisher's version ) (Open Access

Radboud Repository

University of Twente Research Information

Multilingual domain modeling in Twenty-One: automatic creation of a bi-directional translation lexicon from a parallel corpus

Author: Hiemstra Djoerd
Publication venue: Rodopi
Publication date: 01/01/1998
Field of study

Within the project Twenty-One, which aims at the effective dissemination of information on ecology and sustainable development, a sytem is developed that supports cross-language information retrieval in any of the four languages Dutch, English, French and German. Knowledge of this application domain is needed to enhance existing translation resources for the purpose of lexical disambiguation. This paper describes an algorithm for the automated acquisition of a translation lexicon from a parallel corpus. New about the presented algorithm is the statistical language model used. Because the algorithm is based on a symmetric translation model it becomes possible to identify one-to-many and many-to-one relations between words of a language pair. We claim that the presented method has two advantages over algorithms that have been published before. Firstly, because the translation model is more powerful, the resulting bilingual lexicon will be more accurate. Secondly, the resulting bilingual lexicon can be used to translate in both directions between a language pair. Different versions of the algorithm were evaluated on the Dutch and English version of the Agenda 21 corpus, which is a UN document on the application domain of sustainable development

CiteSeerX

Radboud Repository

University of Twente Research Information

Information Retrieval Models

Author: Hiemstra Djoerd
Publication venue: Wiley
Publication date: 01/01/2009
Field of study

Many applications that handle information on the internet would be completely\ud inadequate without the support of information retrieval technology. How would\ud we find information on the world wide web if there were no web search engines?\ud How would we manage our email without spam filtering? Much of the development\ud of information retrieval technology, such as web search engines and spam\ud filters, requires a combination of experimentation and theory. Experimentation\ud and rigorous empirical testing are needed to keep up with increasing volumes of\ud web pages and emails. Furthermore, experimentation and constant adaptation\ud of technology is needed in practice to counteract the effects of people that deliberately\ud try to manipulate the technology, such as email spammers. However,\ud if experimentation is not guided by theory, engineering becomes trial and error.\ud New problems and challenges for information retrieval come up constantly.\ud They cannot possibly be solved by trial and error alone. So, what is the theory\ud of information retrieval?\ud There is not one convincing answer to this question. There are many theories,\ud here called formal models, and each model is helpful for the development of\ud some information retrieval tools, but not so helpful for the development others.\ud In order to understand information retrieval, it is essential to learn about these\ud retrieval models. In this chapter, some of the most important retrieval models\ud are gathered and explained in a tutorial style

CiteSeerX

University of Twente Research Information

University of Twente @ TREC 2009: Indexing half a billion web pages

Author: Hauff Claudia
Hiemstra Djoerd
Publication venue: National Institute of Standards and Technology (NIST)
Publication date: 01/01/2009
Field of study

This report presents results for the TREC 2009 adhoc task, the diversity task, and the relevance feedback task. We present ideas for unsupervised tuning of search system, an approach for spam removal, and the use of categories and query log information for diversifying search results

CiteSeerX

Radboud Repository

University of Twente Research Information

Brute Force Information Retrieval Experiments using MapReduce

Author: Hauff Claudia
Hiemstra Djoerd
Publication venue: European Research Consortium for Informatics and Mathematics
Publication date: 01/01/2012
Field of study

MIREX (MapReduce Information Retrieval Experiments) is a software library initially developed by the Database Group of the University of Twente for running large scale information retrieval experiments on clusters of machines. MIREX has been tested on web crawls of up to half a billion web pages, totalling about 12.5 TB of data uncompressed. MIREX shows that the execution of test queries by a brute force linear scan of pages, is a viable alternative to running the test queries on a search engine’s inverted index. MIREX is open source and available for others

Radboud Repository

University of Twente Research Information

A Simulator for Concept Detector Output

Author: Aly Robin
Hiemstra Djoerd
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2009
Field of study

Concept based video retrieval is a promising search paradigm because it is fully automated and it investigates the fine grained content of a video, which is normally not captured by human annotations. Concepts are captured by so-called concept detectors. However, since these detectors do not yet show a sufficient performance, the evaluation of retrieval systems, which are built on top of the detector output, is difficult. In this report we describe a software package which generates simulated detector output for a specified performance level. Afterwards, this output can be used to execute a search run and ultimately to evaluate the performance of the proposed retrieval method, which is normally done through comparison to a baseline. The probabilistic model of the detectors are two Gaussians, one for the positive and one for the negative class. Thus, the parameters for the simulation are the two means and deviations plus the prior probability of the concept in the dataset

CiteSeerX

Radboud Repository

University of Twente Research Information

MapReduce for information retrieval evaluation: "Let's quickly test this on 12 TB of data"

Author: Hauff Claudia
Hiemstra Djoerd
Publication venue: Springer
Publication date: 01/01/2010
Field of study

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.net

CiteSeerX

Crossref

Radboud Repository

University of Twente Research Information

MIREX: MapReduce Information Retrieval Experiments

Author: Hauff Claudia
Hiemstra Djoerd
Publication venue
Publication date: 01/01/2010
Field of study

We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost ma- chines to search a web crawl of 0.5 billion pages showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort. The code is available to other researchers at: http://mirex.sourceforge.ne

arXiv.org e-Print Archive

CiteSeerX

University of Twente Research Information